[CI/Build] Basic server correctness test #237

derekk-nm · 2024-05-13T14:03:24Z

Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created with AutoModelForCausalLM.from_pretrained().

Updates HfRunner() to accept a HuggingFace access token to be able to retrieve models that are restricted access.

The new HfRunnerNM.generate_greedy_logprobs_nm_use_tokens() allows us to compare the HuggingFace generated results (which reports logprobs with token ids) with that from the vllm OpenAI Server (which reports logprobs with token text). This included a new _decode_token_by_position_index() method to properly calculate the token string by using a lookback on the generated tokens list.

Enhances the output of the check_logprobs_close() function to provide more details about the failing tokens.

Adds the test to the appropriate skip-*.txt files so that this long running test won’t get automatically run during automatic dev push workflows.

To run this test manually;
[assumes that you’ve downloaded and installed the local nm-vllm package with pip install -e .[sparse] and all of the packages from requirements-common.txt, reqirements-cuda.txt, and requirements-dev.txt]

Define the HF_TOKEN environment variable with a valid HuggingFace access token
cd to the nm-vllm directory
Run the test with the command:
-- python3 -m pytest --forked tests/basic_correctness/test_basic_server_correctness.py -k test_models_on_server

[note that running this from my local env I needed to include the “--import-mode importlib“ option to workaround a known issue in vllm]

derekk-nm · 2024-05-13T14:07:57Z

This test is failing today. Something's been broken over the weekend. The exception is:

==== server startup command args ====
--model mistralai/Mistral-7B-Instruct-v0.2 --max-model-len 4096 --disable-log-requests --tensor-parallel-size 2 --dtype half
====
(ServerRunner pid=1801782) Traceback (most recent call last):
(ServerRunner pid=1801782)   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(ServerRunner pid=1801782)     return _run_code(code, main_globals, None,
(ServerRunner pid=1801782)   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
(ServerRunner pid=1801782)     exec(code, run_globals)
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/entrypoints/openai/api_server.py", line 23, in <module>
(ServerRunner pid=1801782)     from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/entrypoints/openai/serving_chat.py", line 15, in <module>
(ServerRunner pid=1801782)     from vllm.model_executor.guided_decoding import (
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/model_executor/guided_decoding/__init__.py", line 5, in <module>
(ServerRunner pid=1801782)     from vllm.model_executor.guided_decoding.lm_format_enforcer_decoding import (
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 5, in <module>
(ServerRunner pid=1801782)     from lmformatenforcer import (CharacterLevelParser, JsonSchemaParser,
(ServerRunner pid=1801782) ModuleNotFoundError: No module named 'lmformatenforcer'

derekk-nm · 2024-05-13T14:09:54Z

I don't understand why the build was skipped. I didn't try to skip it.

dbarbuzzi · 2024-05-17T19:13:31Z

A couple of notes:

The one test failure in the remote-push job is an intermittent marlin-related failure
I ran these tests for the magic-wand and nm-vllm RCs for release testing and they both passed

derekk-nm · 2024-05-20T13:24:01Z

After rebasing this branch onto main, the test is passing for me with the single Mistral model:

/root/pyvenv/nmv1/bin/python3 -m pytest --forked --import-mode importlib tests/basic_correctness/test_basic_server_correctness.py -k test_models_on_server 
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.2.1, pluggy-1.5.0
rootdir: /network/derekk/testdev1/nm-vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, forked-1.6.0, anyio-4.3.0, shard-0.1.2, asyncio-0.23.7
asyncio: mode=strict
collected 2 items
Running 2 items in this shard

tests/basic_correctness/test_basic_server_correctness.py ..              [100%]

======================== 2 passed in 767.36s (0:12:47) =========================

derekk-nm · 2024-05-22T12:40:58Z

Per Slack discussions, I've updated the test to include most of the remaining models in the test execution (some need to be skipped if the model requires a GPU device capability greater than that available on the GPU under test). It was also necessary to ignore "special tokens" output by the HuggingFace runner for a few prompts in a number of models. The practice to simply convert any special token to an empty string worked for all but one test:

============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.2.1, pluggy-1.5.0
rootdir: /network/derekk/testdev1/nm-vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, forked-1.6.0, anyio-4.3.0, shard-0.1.2, asyncio-0.23.7
asyncio: mode=strict
collected 20 items
Running 20 items in this shard

tests/basic_correctness/test_basic_server_correctness.py ......Fsss..... [ 75%]
.Fsss                                                                    [100%]

....
=========================== short test summary info ============================
FAILED tests/basic_correctness/test_basic_server_correctness.py::test_models_on_server[None-3-32-microsoft/phi-2-2048-None-None]
FAILED tests/basic_correctness/test_basic_server_correctness.py::test_models_on_server[2-3-32-microsoft/phi-2-2048-None-None]
============= 2 failed, 12 passed, 6 skipped in 6544.64s (1:49:04) =============

the failure is the same for both executions with the same model:

E                   AssertionError: hf_model token '! Here’' not in [['’', '‘', '”']]
E                   prompt index 23, token index 4:
E                   hf_model:	'Absolutely! Here! Here’s an updated version of the essay that includes a few more anecdotes:\n\n<|im_start|>user\nWrite a'
E                   vllm_model:	'Absolutely! Here’s an updated version of the essay that includes a few more anecdotes:\n\nMy friendship with Sarah began in the tenth grade, during'

The HuggingFace response in this case w/out the hack had this error:

E                   AssertionError: hf_model token '�' not in [['', "'s", ' are']]
E                   prompt index 23, token index 3:
E                   hf_model:	'Absolutely! Here�! Here’s an updated version of the essay that includes a few more anecdotes:\n\n<|im_start|>user\nWrite a'
E                   vllm_model:	'Absolutely! Here’s an updated version of the essay that includes a few more anecdotes:\n\nI met Sarah in the tenth grade during a challenging time'

So, it's not really related to the special token

Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created with AutoModelForCausalLM.from_pretrained(). Updates HfRunner() to accept a HuggingFace access token to be able to retrieve models that are restricted access The new HfRunnerNM.generate_greedy_logprobs_nm_use_tokens() allows us to compare the HuggingFace generated results (which reports logprobs with token ids) with that from the vllm OpenAI Server (which reports logprobs with token text). This included a new _decode_token_by_position_index() method to properly calculate the token string by using a lookback on the generated tokens list. Enhances the output of the check_logprobs_close() function to provide more details about the failing tokens. Adds the test to the appropriate skip-*.txt files so that this long running test won’t get automatically run during automatic dev push workflows.

Test other models. Skip execution if the model requires a GPU device capability greater than that available on the current device (reusing approach from test_gptq_marlin.py). adds a hack to ignore special tokens after decode of HuggingFace response so that we can fairly compare with vllm server response.

this model fails the test with a specific prompt. to be addressed later.

derekk-nm · 2024-05-28T15:30:00Z

I've rebased this to latest nm-vllm/main. At this point, the test includes a number of models, but skips a few that don't work w/ HuggingFace out of the box, and one that fails the test for a specific prompt. I've got Asana tickets to address these later, so that we can get this committed and running now.

andy-neuma

cool.

requirements-dev.txt

tests/basic_correctness/test_basic_server_correctness.py

andy-neuma · 2024-05-28T15:44:26Z

@derekk-nm could you add a README in "neuralmagic" or "neuralmagic/tests" that outlines:

the goal of these tests (this can be rather brief, but should be enough for other folks to understand)
how to add remove models

andy-neuma

thanks

entries have been moved to the bug report, where failing models will be tracked. removed some additional models that do not work in the build/test env (until a resolution is found) expanded doc on the test case added a README for the *_skip.txt files.

adding tests/basic_correctness/test_basic_server_correctness.py to skip-for-remote-push-tmp.txt

derekk-nm assigned dbarbuzzi, dhuangnm and robertgshaw2-redhat May 13, 2024

derekk-nm force-pushed the basic_server_correctness branch from b00a664 to ba3866a Compare May 20, 2024 12:09

derekk-nm added 3 commits May 28, 2024 11:25

Skip microsoft/phi-2

67acb7f

this model fails the test with a specific prompt. to be addressed later.

derekk-nm force-pushed the basic_server_correctness branch from a9451b9 to 67acb7f Compare May 28, 2024 15:25

derekk-nm marked this pull request as ready for review May 28, 2024 15:27

dbarbuzzi approved these changes May 28, 2024

View reviewed changes

andy-neuma reviewed May 28, 2024

View reviewed changes

requirements-dev.txt Show resolved Hide resolved

tests/basic_correctness/test_basic_server_correctness.py Outdated Show resolved Hide resolved

tests/basic_correctness/test_basic_server_correctness.py Show resolved Hide resolved

andy-neuma approved these changes May 28, 2024

View reviewed changes

derekk-nm added 2 commits May 28, 2024 22:30

removed commented models

01da8d4

entries have been moved to the bug report, where failing models will be tracked. removed some additional models that do not work in the build/test env (until a resolution is found) expanded doc on the test case added a README for the *_skip.txt files.

skip on push

ecd752e

adding tests/basic_correctness/test_basic_server_correctness.py to skip-for-remote-push-tmp.txt

derekk-nm merged commit f687019 into main May 29, 2024
12 checks passed

derekk-nm deleted the basic_server_correctness branch May 29, 2024 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI/Build] Basic server correctness test #237

[CI/Build] Basic server correctness test #237

derekk-nm commented May 13, 2024

derekk-nm commented May 13, 2024

derekk-nm commented May 13, 2024

dbarbuzzi commented May 17, 2024

derekk-nm commented May 20, 2024

derekk-nm commented May 22, 2024 •

edited

Loading

derekk-nm commented May 28, 2024

andy-neuma left a comment

andy-neuma commented May 28, 2024

andy-neuma left a comment

[CI/Build] Basic server correctness test #237

[CI/Build] Basic server correctness test #237

Conversation

derekk-nm commented May 13, 2024

derekk-nm commented May 13, 2024

derekk-nm commented May 13, 2024

dbarbuzzi commented May 17, 2024

derekk-nm commented May 20, 2024

derekk-nm commented May 22, 2024 • edited Loading

derekk-nm commented May 28, 2024

andy-neuma left a comment

Choose a reason for hiding this comment

andy-neuma commented May 28, 2024

andy-neuma left a comment

Choose a reason for hiding this comment

derekk-nm commented May 22, 2024 •

edited

Loading